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(54) Title: SPECTRAL SUBTRACTION NOISE SUPPRESSION METHOD 
(57) Abstract 

A spectral subtraction noise suppression method in a frame 
based digital communication system is described. Each frame includes 
a predetermined number N of audio samples, thereby giving each 
frame N degrees of freedom. The method is performed by a spectral 
subtraction (150) function A(w) which is based on an estimate (140) 
* v(w ) of the power spectral density of background noise of non-speech 
frames and an estimate (130) *.(«) of the power spectral density 
of speech frames. Each speech frame is approximated (120) by a 
parametric model that reduces the number of degrees of freedom to 
less than N. The estimate $,(w) of the power spectral density of each 
speech frame is estimated (130) from the approximative parametric 
model. 
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SPECTRAL SUBTRACTION NOISE SUPPRESSION METHOD 

TECHNICAL FIELD 
The present invention relates to noise suppresion in digital frame b^ed communication 
systems, and to particular «. a spectral subtraction noise suppression method ■„ such 

systems. 

BACKGROUND OF THE INVENTION 
A common problem in speech signal processing is the enhancement of a speech signal 
from its noisy measurement. One approach for speech enhancement based on single 
channel (microphone) measurements is filtering in the frequency don«nn applying spectral 
subtraction techniques, HI, PL Under the assumption that the background noise is long- 
time stations (in comparison with the speech) a mode, of the background noise is usuaUy 
estimate during time intervals with non-speech activity. Then, during data frames ..th 
speech activity, this estimated noise mode! is used together with an estimated model of 
the noisy speech in order to enhance the speech. For the spectral subtraction technroues 
these models are traditionally given in terms of the Power Spectral Density (PSD), that 
is estimated using classical FFT methods. 

None of the abovementioned techniques give in their basic form an output signal with 
satisfactory audible quality in mobile" telephony applications, that is 

1. non distorted speech output 

2. sufficient reduction of the noise level 

3. remaining noise without annoying artifacts 

In particular the spectral subtraction methods are known to violate I when 2 is fulfilled 
or violate 2 when 1 is fulfilled. In addition, in most cases 3 is more or less violated smce 
the methods introduce, so called, musical noise. 

The above drawbacks with the spectral subtraction methods have been known and, 
in the literature, several ad hoc modifications of the basic algorithms have appeared 
for particular speech-in-noise scenarios. However, the problem how to design a spectral 
subtraction method that for general scenarios Mils 1-3 has remained unsolved. 
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In order to highlight the difficulties with speech enhancement from noisy data, note 
that the spectral subtraction methods are based on filtering using estimated models of the 
incoming data. If those estimated models are close to the underlying "true" models, this 
is a well working approach. However, due to the short time stationary of the speech (10- 
40 ms) as well as the physical reality surrounding a mobile telephony application (8000Hz 
sampling frequency, 0.5-2.0 s stationary of the noise, etc.) the estimated models are 
likely to significantly differ from the underlying reality and, thus, result in a filtered 
output with low audible quality. 

EP, Al, 0 588 526 describes a method in which spectral analysis is performed either 
with Fast Fourier Transformation (FFT) or Linear Predictive Coding (LPC). 

SUMMARY OF THE INVENTION 

An object of the present invention is to provide a spectral subtraction noise suppresion 
method that gives a better noise reduction without sacrificing audible quality. 
This object is solved by the characterizing features of claim 1. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention, together with further objects and advantages thereof, may best be 
understood by making reference to the following description taken together with the 
accompanying drawings, in which: 

FIGURE 1 is a block diagram of a spectral subtraction noise suppression system 
suitable for performing the method of the present invention; 

FIGURE 2 is a state diagram of a Voice Activity Detector (VAD) that may be used 
in the system of Fig. 1; 

FIGURE 3 is a diagram of two different Power Spectrum Density estimates of a speech 
frame; 

FIGURE 4 is a time diagram of a sampled audio signal containing speech and back- 
ground noise; 

FIGURE 5 is a time diagram of the signal in Fig. 3 after spectral noise subtraction 
in accordance with the prior art; 

FIGURE 6 is a time diagram of the signal in Fig. 3 after spectral noise subtraction 
in accordance with the present invention; and 

FIGURE 7 is a flow chart illustrating the method of the present invention. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

THE SPECTRAL SUBTRACTION TECHNIQUE 
Consider a frame of speech degraded by additive noise 

X (k) = s(k) + v{k) k = l N 

where ,(*), .(*) and .(Jfc) denote, respectively, the noisy measurement of the speech, the 
speech and the additive noise, and N denotes the number of samples in a frame. 

The speech is assumed stationary over the frame, while the noise is assumed long- 
time stationary, that is stationary over several frames. The number of frames where 
. is stationary is denoted by r » 1. Further, it is assumed that the speech activity . 
sufficiently low.sothatamodel of the noise can be accurately estimated during non-speech 

activity. 

Denote the power spectral densities (PSDs) of, respectively, the measurement, the 
speech and the noise by *,(«), *,(«) and ♦.(«), where 

Know,ng*,M and the pities *.M - ■ W « b ° —* * —•—*»' 

spectral subtraction methods, cf [2|, shortly reviewed below 
Let denote an estimate of s(t). Then, 

«*) = J-'(//(u.)X(u)) 

(3) 



where H ) denotes some Unear transform, for example the Discrete Fourier Transform 
(DFT) and where »H is • real-valued even function in w € (0,2.) and such that 
0 < HW) < 1 The function *M depends on *.M and *„M- Since HM is real- 
Jned.thephaseofSM = »H« equals the phase of the degraded speech. The use 
of real-valned Jf W is motiva«d by the human ears unsensitivity for phase distortron. 

In general, *,M and KM are unknown and have to be replaced in MM by est,- 
mated quantities *,M and **,). Due to the non-stationarity of the speech, W . 
estmuted from a single frame of daw, while *„M * estimated using data in r speech free 
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frames. For simplicity, it is assumed that a Voice Activity Detector (VAD) is available in 
order to distinguish between frames containing noisy speech and frames containing noise 
only. It is assumed that *„(a>) is estimated during non-speech activity by averaging over 
several frames, for example, using 

<M")' = P$v(")'- 1 + (1 - (4) 

In (4), * v (w)' is the (running) averaged PSD estimate based on data up to and including 
frame number I and * v (u>) is the estimate based on the current frame. The scalar p e (0, 1) 
is tuned in relation to the assumed stationary of v(k). An average over r frames roughly 
corresponds to p implicitly given by 

2 

—p= T ( 5 ) 

A suitable PSD estimate (assuming no apriori assumptions on the spectral shape of the 
background noise) is given by 

= ^Mv*M (6) 

where denotes the complex conjugate and where V(u) = F{v{k)). With T( ) = 
FFT(-) (Fast Fourier Transformation), $ v (w) is the Periodigram and 4 v (w) in (4) is the 
averaged Periodigram, both leading "to asymptotically {N » 1) unbiased PSD estimates 
with approximative variances 



Var(* v (u;)) « <S>l(u,) 
Var(*.(w)) « 

T 

A similar expression to (7) holds true for * x (u;) during speech activity (replacing $l(u>) 
in (7) with **(u>)). 

A spectral subtraction noise suppression system suitable for performing the method 
of the present invention is illustrated in block form in Fig. 1. Prom a microphone 10 
the audio signal x{t) is forwarded to an A/D converter 12. A/D converter 12 forwards 
digitized audio samples in frame form {x(A:)} to a transform block 14, for example a 
FFT (Fast Fourier Transform) block, which transforms each frame into a corresponding 



frequencv transformed frame {*<«)}. The transformed frame is filtered by in block 
16 This step performs the actual spectral subtraction. The resulting signal {S(«)> is 
transformed back to the time domain by an inverse transform block 18. The result is a 
frame {*(*)} in which the noise has been suppressed. This frame may be forwarded to 
an echo canceler 20 and thereafter to a speech encoder 22. The speech encoded signal is 
then forwarded to a channel encoder and modulator for transmission (these elements are 
not shown). 

The actual form of («) in block 16 depends on the estimates *,(«), which 
are formed in PSD estimator 24, and the analytical expression of these estimates that 
is used. Examples of different expressions are given in Table 2 of the next section. The 
major part of the following description will concentrate on different methods of forming 
estimates 4,(u/), *•(«) from the input frame {x(fc)}. 

PSD estimator 24 is controUed by a Voice Activity Detector (VAD) 26, which uses 
input frame {x(Jk)} to determine whether the frame contains speech (S) or background 
noise (B). A suitable VAD is described in [5], [6]. The VAD may be implemented as 
a state machine having the 4 states illustrated in Fig. 2. The resulting control signal 
S/B is forwarded to PSD estimator 24. When VAD 26 indicates speech (S). states 21 
and 22 PSD estimator 24 will form *,(«-). On the other hand, when VAD 26 indicates 
non-speech activity (B). state 20, PSD estimator 24 will form The latter estimate 

will be used to form during the next speech frame sequence (together with ♦,(*) 
of each of the frames of that sequence). 

Signal S/B is also forwarded to spectral subtraction block 16. In this way block 16 
mav apply different filters during speech and non-speech frames. During speech frames 
H(u) is the above mentioned expression of * x (*), ♦.(»)• On the other hand, during 
non-speech frames If («) may be a constant H (0 < H < D that reduces the background 
sound level to the same level as the background sound level that remains in speech frames 
after noise suppression. In this way the perceived noise level will be the same during both 
speech and non-speech frames. 

Before the output signal J(fc) in (3) is calculated, H{») - a preferred embodi- 

ment, be post filtered according to 

ff p M = max(0.1,W( W )HM) Vw (») 
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Table 1: The postfiltering functions. 

STATE (st) H{oj) COMMENT 



0 


1 (Vwj 


s(k) = x(k) 


20 


0.316 (Vw) 


muting -lOdB 


21 


0.7 H(u>) 


cautios filtering (-3dB) 


22 


H(u>) 
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where H(u>) is calculated according to Table 1. The scalar 0.1 implies that the noise floor 
is -20dB. 

Furthermore, signal S/B is also forwarded to speech encoder 22. This enables different 
encoding of speech and background sounds. 

PSD ERROR ANALYSIS 

It is obvious that the stationarity assumptions imposed on s(k) and v(k) give rise to 
bounds on how accurate the estimate s(k) is in comparison with the noise free speech signal 
$[k). In this Section, an analysis technique for spectral subtraction methods is introduced. 
It is based on first order approximations of the PSD estimates <E> x (u/) and, respectively, 
(see (11) below), in combination with approximative (zero order approximations) 
expressions for the accuracy of the introduced deviations. Explicitly, in the following an 
expression is derived for the frequency domain error of the estimated signal s(k) y due to 
the method used (the choice of transfer function H(uj)) and due to the accuracy of the 
involved PSD estimators. Due to the human ears unsensitivity for phase distortion it is 
relevant to consider the PSD error, defined by 

= <M<*>) - <M") (9) 

where 

4> a (o,) = J/ 2 H*^u/) (10) 

Note that $,(u/) by construction is an error term describing the difference (in the frequency 
domain) between the magnitude of the filtered noisy measurement and the magnitude of 
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the speech. Therefore, can take both positive and negative values and is not the 

PSD of any time domain signal. In (10), H(u) denotes an estimate of JST («) based on *,(«) 
and In this Section, the analysis is restricted to the case of Power Subtraction 

(PS), (2). Other choices of H{u>) can be analyzed in a similar way (see APPENDIX A-C). 
In addition novel choices of H{u>) are introduced and analyzed (see APPENDIX D-G). A 
summary of different suitable choices of H{u) is given in Table 2. 

Table 2: Examples of different spectral subtraction methods: Power Sub- 
traction (PS) (standard PS. H PS (») for 6 = 1), Magnitude Sub- 
traction (MS), spectral subtraction methods based on Wiener Fil- 
tering (WF) and Maximum Likelihood (ML) methodologies and 
Improved Power Subtraction (IPS) in accordance with a preferred 
embodiment of the present invention. 

H(«) 



£«j>s(«) - ^/l - «♦•(*)/*«(«•») 



J* MSM = 1 " y/*v(»)t**i") 
Hwf(>») = B% s (u) 
WMt(w) = 5(1 + # psM) 
HipsM = \/g(u)Hps(o>) 



By definition, //(u,) belongs to the interval 0 < H{u>) < 1, which not necesarilly 
holds true for the corresponding estimated quantities in Table 2 and, therfore, in practice 
half-wave or full-wave rectification, [1], is used. 

In order to perform the analysis, assume that the frame length N is sufficiently large 
{N » 1) so that *,(«) and •„(*) are approximately unbiased. Introduce the first order 
deviations 

*«(«•>) = **(") + M<") 

(11) 
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*v(w) = *„(«;) + A v (w) 

where A x (u/) and A„(w) are zero-mean stochastic variables such that 
E[A x [u)/* x (u)]* « l and E[A v (<j)/$ v (u>)) 2 <£ 1. Here and in the sequel, the notation 
E[ ) denotes statistical expectation. Further, if the correlation time of the noise is short, 
compared to the frame length, £[(*„(u/)' - *„(«)) (* V M* - $„(«,.))] « 0 for t ji k, where 
is the estimate based on the data in the *-th frame. This implies that A r (w) 
and A„(w) are approximately independent. Otherwise, if the noise is strongly correlated, 
assume that * v (u) has a limited (<r N) number of (strong) peaks located at frequencies 
wi, • • • , Then, El(* v (u>)'-* v (u,)) (*„(u,) k -* v (u,))l « 0 holds for v # Uj j = 1, .... n 
and f ^ A: and the analysis still holds true for u> # \j = 1, . . . , n . 

Equation (11) implies that asymptotical (N » 1) unbiased PSD estimators such as 
the Periodogram or the averaged Periodogram are used. However, using asymptotically 
biased PSD estimators, such as the Blackman-Turkey PSD estimator, a similar analysis 
holds true replacing (11) with 

= $x(u>) + A x {u) + B x {ui) 

and 

<Mw) = $ v [u) + A„(w) + B v (u>) 

where, respectively, B x (u>) and B v (cj) are deterministic terms describing the asymptotic 
bias in the PSD estimators. 

Further, equation (11) implies that 4 a (u>) in (9) is (in the first order approximation) 
a linear function in A x (u>) and A„(w). In the following, the performance of the different 
methods in terms of the bias error (E[$ 9 (u)\) and the error variance (Var($ a (u>))) are 
considered. A complete derivation will be given for H PS {u) in the next section. Similar 
derivations for the other spectral subtraction methods of Table 1 are given in APPENDIX 
A-G. 

ANALYSIS OF H PS {u>) {H SPS (u>) for 6 = 1) 

Inserting (10) and H PS (u>) from Table 2 into (9), using the Taylor series expansion 
(1 + x) -1 — 1 x and neglecting higher than first order deviations, a straightforward 
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calculation gives 

* x (w) 

where is used to denote an approximate equality in which only the dominant terms 
are retained. The quantities and A„M are zero-mean stochastic variables. Thus, 

~ 0 < 13 ) 

and 

Var(* s ( W )) =f |||^Var(* I ( W )) + V«r(*.H) (M) 

In order to continue we use the-general result that, for an asymptotically unbiased spectral 
estimator cf (7) 

Var(4(u;)) = 7M* 2 M (15) 

for some (possibly frequency dependent) variable y(u>). For example, the Periodogram 
corresponds to « 1 + (sinu,* /AT sin*) 2 , which for N » 1 reduces to 7 « 1. 
Combining (14) and (15) gives 

Var(4 a (u/))^7*vM < l6) 

RESULTS FOR Hms{") 

Similar calculations for H MS (») give (details are given in APPENDIX A): 



£[*,(")] ~ 2<M") ^">]i^)J 



and 



Var(*,(«,)) = (l - >] 1 + f^j) 



RESULTS FOR J*wfM 

Calculations for HwfW) give (details are given in APPENDIX B): 



and 



Var(^( w ))^4(l-^M) 2 7 ^ H 

RESULTS FOR H ML (u) 

Calculations for H ML (u) give (details are given in APPENDIX C): 

and 



RESULTS FOR H JPS (u) 

Calculations for H ips (uj) give (Hi PS (u>) is derived in APPENDIX D and analyzed 
APPENDIX E): 



E[* 3 (u>)) ~ (G(cj) - l)$ s ( w ) 

and 



Var(*,(u;)) ~ (? 2 ( w ) 



COMMON FEATURES 
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For the considered methods it is noted that the bias error only depends on the choice 
of »(«). while the error variance depends both on the choice of and the variance of 
the PSD estimators used. For example, for the averaged Periodogram estimate of *„(«) 
one has, from (7), that * * 1/r. On the other hand, using a single frame Periodogram 
for the estimation of ».(«), one has 7 , « 1. Thus, fcr r > 1 the dominant term » 
T = 7l + 7v , appearing in the above vriance equations, is lx and thus the main error 
source is the single frame PSD estimate based on the the noisy speech. 

From the above remarks, it follows that in order to improve the spectral subtraction 
techniques, it is desirable to decrease the value of 7l (select an appropriate PSD estimator, 
that is an approximately unbiased estimator with as good performance as possible) and 
select a "good" spectral subtraction technique (select *(»))■ A key idea of the present 
invention is that the value of 7 * can be reduced using physical modeling (reducing the 
number of degrees of freedom from N (the number of samples in a frame) to a value less 
than N) of the vocal tract. It is well known that «(fc) can be accurately described by an 
autoregressive (AR) model (typically of order p « 10). This is the topic of the next two 
sections. 

In addition, the accuracy of *.(*) (and, implicitly, the accuracy of s(k)) depends 
on the choice of H{u). New, preferred choices of H(u) are derived and analyzed in 
APPENDIX D-G. 

SPEECH AR MODELING 

In a preferred embodiment, of the present invention s(k) is modeled as an autoregressive 

(AR) process 

*( fc ) = ^(*) ' = 

where A^ 1 ) is a monic (the leading coefficient equals one) p-th order polynomial in the 
backward shift operator (g"M*) = «(* ~ *)» etc ) 

A{r *) = l + a l9 " 1 + - (18) 

and «(*) is white zero-mean noise with variance At a first glance, it may seem re- 
strictive to consider AR models only. However, the use of AR models for speech modehng 
is motivated both from physical modeling of the vocal tract and, which is more important 



here, from physical limitations from the noisy speech on the accuracy of the estimated 
models. 

In speech signal processing, the frame length N may not be large enough to allow 
application of averaging techniques inside the frame in order to reduce the variance and, 
still, preserve the unbiasness of the PSD estimator. Thus, in order to decrease the effect 
of the first term in for example equation (12) physical modeling of the vocal tract has to 
be used. The AR structure (17) is imposed onto s(k). Explicitly, 

•^"Pi^jp +*•(«) (19) 
In addition, Q v (u/) may be described with a parametric model 

where B(q~ x ), and Cfa" 1 ) are, respectively, 9 -th and r-th order polynomials, defined 
similarly to A{q~ l ) in (18). For simplicity a parametric noise model in (20) is used in 
the discussion below where the order of the parametric model is estimated. However, it 
is appreciated that other models of background noise are also possible. Combining (19) 
and (20), one can show that 

X(k) = Aiq-hl-if W (21) 

where v (k) is zero mean white noise with variance a\ and where D{q~ l ) is given by the 
identity 

al\D{en\ 2 = °l\C{enf + a 2 |f*(0| 2 M(0| 2 (22) 

SPEECH PARAMETER ESTIMATION 

Estimating the parameters in (17)-(18) is straightforward when no additional noise is 
present. Note that in the noise free case, the second term on the right hand side of (22) 
vanishes and, thus, (21) reduces to (17) after pole-zero cancellations. 

Here, a PSD estimator based on the autocorrelation method is sought. The motivation 
for this is fourfold. 

• The autocorrelation method is well known. In particular, the estimated parameters 
are minimum phase, ensuring the stability of the resulting filter. 



. Using the Levinson algorithm, the method is easily implemented and has a low 

computational complexity. 
• An optimal procedure includes a nonlinear optimization, explicitly requiring some 

initialization procedure. The autocorrelation method requires none. 

.. From a practical point of view, it is favorable if the same estimation procedure 
can be used for the degraded speech and, respectively, the clean speech when it 
is available. In other words, the estimation method should be independent of the 
actual scenario of operation, that is independent of the speech-to-noise ratio. 

It is well known that an ARMA model (such as (21)) can be modeled by an infinite 
order AR process. When a finite number of data are available for parameter estimation, 
the infinite order AR model has to be truncated. Here, the model used is 

where Fig- 1 ) is of order p. An appropriate model order follows from the discussion below. 
The approximative model (23) is close to the speech in noise process if their PSDs are 
approximately equal, that is 

\D{e^)\ 2 „ 1 (24) 
|A(e-)| 2 |C(e")| 2 ~ |F(e-)| 2 

Based on the physical modeling of the vocal tract, it is common to consider p = 
deg{A(q~ 1 )) = 10. From (24) it also follows that p = deg(F( 9 - 1 ) » deg(A{q~ 1 )) + 
deg(C(g- 1 )) = P + r, where p + r roughly equals the number of peaks in $,(0,). On the 
other- hand, modeling noisy narrow band processes using AR models requires p « N in 
order to ensure realible PSD estimates. Summarizing, 

P + t «: p < N 

A suitable rule-of-thumb is given by p ~ yfti. From the above discussion, one can expect 
that a parametric approach is fruitful when N » 100. One can also conclude from (22) 
that the flatter the noise spectra is the smaller values of N is allowed. Even if p is not 
large enough, the parametric approach is expected to give reasonable results. The reason 
for this is that the parametric approach gives, in terms of error variance, significantly 



more accurate PSD estimates than a Periodogram based approach (in a typical example 
the ratio between the variances equals 1:8; see below), which significantly reduce artifacts 
as tonal noise in the output. 

The parametric PSD estimator is summarized as follows. Use the autocorrelation 
method and a high order AR model (model order p » p and p ~ \/N) in order to 
calculate the AR parameters {/j, . . . J p ] and the noise variance a\ in (23). From the 
estimated AR model calculate (in N discrete points corresponding to the frequency bins 
of X(lj) in (3)) * x (w) according to 



*-H = -F^ (25) 

Then one of the considered spectral subtraction techniques in Table 2 is used in order to 
enhance the speech s(k). 

Next a low order approximation for the variance of the parametric PSD estimator 
(similar to (7) for the nonparametric methods considered) and, thus, a Fourier series ex- 
pansion of s(k) is used under the assumption that the noise is white. Then the asymptotic 
(for both the number of data (N » 1) and the model order (p » 1)) variance of $ x (u>) 
is given by 

Var(^(u,)) ^ f * x (") (26) 

The above expression also holds true for a pure (high-order) AR process. From (26), it 
directly follows that y x « 2p/A\ that, according to the aforementioned rule-of- thumb, 
approximately equals 7x cs 2/y/N, which should be compared with 7 X « 1 that holds true 
for a Periodogram based PSD estimator. 

As an example, in a mobile telephony hands free environment, it is reasonable to 
assume that the noise is stationary for about 0.5 s (at 8000 Hz sampling rate and frame 
length N = 256) that gives t % 15 and, thus, ^ v ~ 1/15. Further, for p = y/W we have 
7x = 1/8. 

Fig. 3 illustrates the difference between a periodogram PSD estimate and a parametric 
PSD estimate in accordance with the present invention for a typical speech frame. In this 
example N=256 (256 samples) and an AR model with 10 parameters has been used. It is 
noted that the parametric PSD estimate <Mui) is much smoother than the corresponding 
periodogram PSD estimate. 
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Fig. 4 illustrates 5 seconds of a sampled audio signal containing speech in a noisy 
background. Fig. 5 illustrates the signal of Fig. 4 after spectral subtraction based on a 
periodogram PSD estimate that gives priority to high audible quality. Fig. 6 illustrates 
the signal of Fig. 4 after spectral subtraction based on a parametric PSD estimate in 
accordance with the present invention. 

A comparison of Fig. 5 and Fig. 6 shows that a significant noise suppression (of the 
order of 10 dB) is obtained by the method in accordance with the present invention. (As 
was noted above in connection with the description of Fig. 1 the reduced noise levels 
are the same in both speech and non-speech frames.) Another difference, which is not 
apparent from Fig. 6, is that the resulting speech signal is less distorted than the speech 
signal of Fig. 5. 

The theoretical results, in terms of bias and error variance of the PSD error, for all 
the considered methods are summarized in Table 3. 

It is possible to rank the different methods. One can, at least, distinguish two criteria 
for how to select an appropriate method. 

First, for low instantaneous SNR, it is desirable that the method has low variance in 
order to avoid tonal artifacts in s(k). This is not possible without an increased bias, and 
this bias term should, in order to suppress (and not amplify) the frequency regions with 
low instantaneous SNR, have a negative sign (thus, forcing $ s (u>) in (9) towards zero). 
The candidates that fulfill this criterion are, respectively, MS, IPS and WF. 

Secondly, for high instantaneous SNR, a low rate of speech distortion is desirable. 
Further if the bias term is dominant, it should have a positive sign. ML, 6PS, PS, IPS 
and (possibly) WF fulfill the first statement. The bias term dominates in the MSE 
expression only for ML and WF, where the sign of the bias terms are positive for ML 
and, respectively, negative for WF. Thus, ML, 5PS, PS and IPS fulfill this criterion. 

ALGORITHMIC ASPECTS 

In this section preferred embodiments of the spectral subtraction method in accordance 
with the present invention are described with reference to Fig. 7. 

1. Input: x= {x(k)\k= 

2. Design variables 
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Table 3: Bias and variance expressions for Power Subtraction (PS) (stan- 
dard PS, Hpsfa) for 6 = 1), Magnitude subtraction (MS), Im- 
proved Power Subtraction (IPS) and spectral subtraction meth- 
ods based on Wiener Filtering (WF) and Maximum Likelihood 
(ML) methodologies. The instantaneous SNR is defined by SNR= 
$j(w)/$ v (")- For PS, the optimal subtraction factor 6 is given 
by (58) and for IPS, G(w) is given by (45) with * x (u/) and & v (w) 
there replaced by, respectively, $ x (<->) and $ v (<*0* 

H{u) Bias Variance 



E&.HJ/Mu,) 



Var(* # (w))/ 7 »2(u;) 



6PS 



1-6 



MS 



-2(v/l + SNR-l) 



(n/I + SNR-1) 2 



IPS 



tSNR 
7+SNR a 




WF 






2 



ML 
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p speech-in-noise model order 

p running average update factor for $ v (u>) 



3. For each frame of input data do: 

(a) Speech detection (step 110) 

The variable Speech is set to true if the VAD output equals st = 21 or st = 22. 
Speech is set to false if st « 20. If the VAD output equals st = 0 then the 
algorithm is reinitialized. 

(b) Spectral estimation 

If Speech estimate * x (^) : 

i. Estimate the coefficients (the polynomial coefficients {f u ■ ■ - , fp] the 
variance &$) of the all-pole model (23) using the autocorrelation method 
applied to zero mean adjusted input data {x(k)} (step 120). 

ii. Calculate according to (25) (step 130). 
else estimate 4> v (u>) (step 140) 

i. Update the background noise spectral model $ v (w) using (4), where $ v (u>) 
is the Periodogram based on zero mean adjusted and Hanning/Hamming 
windowed input, data x. Since windowed data is used here, while $ x (u>) 
is based on unwindowed data, $ v (u) has to be properly normalized. A 
suitable initial value of * v {u>) is given by the average (over the frequency 
bins) of the Periodogram of the first frame scaled by, for example, a factor 
0.25, meaning that, initially, a apriori white noise assumption is imposed 
on the background noise, 
(c) Spectral subtraction (step 150) 

i. Calculate the frequency weighting function H{u>) according to Table 1. 

ii. Possible postfiltering, muting and noise floor adjustment. 

iii. Calculate the output using (3) and zero-mean adjusted data {x(fc)>. The 
data {x(k)} may be windowed or not, depending on the actual frame 
overlap (rectangular window is used for non-overlapping frames, while a 
Hanning window is used with a 50% overlap). 



From the above description it is clear that the present invention results in a sig- 
nificant noise reduction without sacrificing audible quality. This improvement may be 
explained by the separate power spectrum estimation methods used for speech and non- 
speech frames. These methods take advantage of the different characters of speech and 
non-speech (background noise) signals to minimize the variance of the respective power 
spectrum estimates 

• For non-speech frames Q v (u) is estimated by a non-parametric power spectrum 
estimation method, for example an FFT based periodogram estimation, which uses 
all the N samples of each frame. By retaining all the N degrees of freedom of the 
non-speech frame a larger variety of background noises may be modeled. Since the 
background noise is assumed to be stationary over several frames, a reduction of the 
variance of $ v (u>) may be obtained by averaging the power spectrum estimate over 
several non-speech frames. 

• For speech frames $ x (u>) is estimated by a parametric power spectrum estimation 
method based on a parametric model of speech. In this case the special character 
of speech is used to reduce the number of degrees of freedom (to the number of 
parameters in the parametric model) of the speech frame. A model based on fewer 
parameters reduces the variance of the power spectrum estimate. This approach is 
preferred for speech frames, since speech is assumed to be stationary only over a 
frame. 

It will be understood by those skilled in the art that various modifications and changes 
may be made to the present invention without departure from the spirit and scope thereof 
which is defined by the appended claims. 
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APPENDIX A 

ANALYSIS OF H MS (u) 

Paralleling the calculations for H M s(u>) gives 

*' m= ('-JS)*' m -*- m 



(27) 



where in the second equality, also the Taylor series expansion \ZT+"x ct 1 + x/2 is used. 
From (27) it follows that the expected value of *,(w) is non-zero, given by 



£[*.(*)] ~ 2*„(u,) ^1 - ^ (M) 



Further, 



Var($ a (u/)) ~ 

2 



(29) 



Combining (29) and (15) 



Var(*.(«)) s ( 1 - x | 1 + ^| J 7 *2M (30) 



APPENDIX B 



ANALYSIS OF H W f(v) 

In this Appendix, the PSD error is derived for speech enhancement based on Wiener 
filtering, [2]. In this case, H(u>) is given by 

= a , t^i , , = ( 31 > 
*,(u;) + *v(u;) 

Here, $ 9 {u) is an estimate of $ s {u>) and the second equality follows from 4> 5 (u;) = $ x (t*>)- 
4 v (u). Noting that 



a straightforward calculation gives 



x (-*.(«) + 2 {^A. (W) _A.( W) }) 



From (33), it follows that 



and 

2 



(32) 



(33) 



£[*,{»)) *-(l-%^* v (») 04) 



Var(*,(u,))~4(l-|^ 7 **(") (35) 
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APPENDIX C 

ANALYSIS OF H ML {u>) 

Characterizing the speech by a deterministic wave-form of unknown amplitude and 
phase, a maximum likelihood (ML) spectral subtraction method is defined by 



(36) 



(37) 



= ^(l + tfpsM) 

Inserting (11) into (36) a straightforward calculation gives 

u i\ !*-(«)/; A *(«).«.(w)A,(«)\«\ 

(I^a.m-a^)) 

where in the first equality the Taylor series expansion (1 + x) -1 ^ 1 - x and in the 
second \/\ 4-x ~ 1 + x/2 are used. Now, it is straightforward to calculate the PSD 
error. Inserting (37) into (9)- (10) gives, neglecting higher than first order deviations in 
the expansion of H\, L {y)) 

n(-jig)(^>-H 

From (38), it follows that 



(38) 



(39) 



where in the second equality (2) is used. Further, 

^^f'+S) (4o> 



WO 96/24128 



22 



PCT/SE96/00024 



APPENDIX D 

DERIVATION OF Hips(u) 

When and $ v (u>) are exactly known, the squared PSD error is minimized by 

Hpsfa), that is Hps(u>) with 4 x (u/) and *„(w) replaced by $ x (u) and $ w (u/), respectively. 
This fact follows directly from (9) and (10), viz. *,(u/) = [H 2 (u>)<P x (v) - $,(u/)p = 0, 
where (2) is used in the last equality. Note that in this case H (u>) is a deterministic quan- 
tity, while H{uj) is a stochastic quantity. Taking the uncertainty of the PSD estimates into 
account, this fact, in general, no longer holds true and in this Section a data-independent 
weighting function is derived in order to improve the performance of Hps(u). Towards 
- this end,- a variance expression of the form 

Var(* 5 (u;))-^7*v(^) (41) 

is considered (£ = 1 for PS and f = (1 - VT+SNR) 2 for MS and 7 = 7* + 7 V ) The 
variable 7 depends only on the PSD estimation method used and cannot be affected by 
the choice of transfer function H (u;). The first factor f , however, depends on the choice 
of H(uj). In this section, a data independent weighting function G(ut) is sought, such that 
H{v) = \Jg(uj) Hps{v) minimizes the expectation of the squared PSD error, that is 

G{u>) - arg min£[$ 5 (u;)] 2 

G{u>) 

(42) 

= G(u>)H 2 PS (u>)$ x (uj)-$ a (u>) 

In (42), G(u) is a generic weigthing function. Before we continue, note that if the weight- 
ing function G(u>) is allowed to be data dependent a general class of spectral subtraction 
techniques results, which includes as special cases many of the commonly used methods, 
for example, Magnitude Subtraction using G(u) = Blf S {tj)/Hp S (u;). This observation 
is, however, of little interest since the optimization of (42) with a data dependent G(u;) 
heavily depends on the form of G{u>). Thus the methods which use a data-dependent 
weighting function should be analyzed one-by-one, since no general results can be derived 
in such a case. 

In order to minimize (42), a straightforward calculation gives 



Taking expectation of the squared PSD error and using (41) gives 

E[*,{u)) 2 a (CM - lWM + <? 2 M 7 **M (44) 
Equation (44) is quadratic in G(w) and can be analytically minimized. The result reads, 

_ $*M 

G(a,) " •*(«) + 7 »5M (45) 

1 

i+ T r A» I (w)-*„(«)/ 

where in the second equality (2) is used. Not surprisingly, G(u) depends on the (unknown) 
PSDs and the variable 7. As noted above, one cannot directly replace the unknown 
PSDs in (45) with the corresponding estimates and claim that the resulting modified PS 
method is optimal, that is minimizes (42). However, it can be expected that, taking the 
uncertainty of *,M and *„M into account in the design procedure, the modified PS 
method will perform "better" than standard PS. Due to the above consideration, this 
modified PS method is denoted by Improved Power Subtraction (IPS). Before the IPS 
method is analyzed in APPENDIX E r the following remarks are in order. 

For high instantaneous SNR (for u such that *,M/*vM > 1) il follows from ( 45 > 
that. <5(u/) ~ 1 and, since the normalized error variance Var(* 4 M)/**M. see (41) is 
small in this case, it can be concluded that the performance of IPS is (very) close to the 
performance of the standard PS. On the other hand, for low instantaneous SNR (for u 
such that 7 **M » S*(c4). <?M * *2(o4/(7*vM)v leading to, cf. (43) 

£[*,M1 « "*.M (46) 

and 

3$ <47) 

However, in the low SNR it cannot be concluded that (46)-(47) are even approximately 
valid when <?M in (45) is replaced by 6(u>), that is replacing * X M and 4>„M in (45) 
with their estimated values * X M and 6„M, respectively. 



APPENDIX E 



ANALYSIS OF H IPS {u>) 

In this APPENDIX, the IPS method is analyzed. In view of (45), let G(u>) be defined 
by (45), with and $ x (u>) there replaced by the corresponding estimated quantities. 

It may be shown that 

*,(u/) ~ «5(«) - l)* a (u>) 

+GH (l^AxM - A„(u;)) (48) 

which can be compared with (43). Explicitly, 



£[*.(u;)] ~ (G(u) - l)*,(w) (49) 



and 



Var(*,(a,)) ~ G\u>) 



( *i x * / %*«(«) + 24>x(w)\ 2 - . 



(50) 



For high SNR, such that & 3 (lj)/& v (lj) » 1, some insight can be gained into (49)-(50). In 
this case, one can show that 

E[$ s (u>)) - 0 (51) 

and 

Var(*.(w)) c (l + *r|^ ) 7 *» (w) (52 > 

The neglected terms in (51) and (52) are of order 0((& v (uj)/Q s [u)) 2 ). Thus, as al- 
ready claimed, the performance of IPS is similar to the performance of the PS at high 
SNR. On the other hand, for low SNR (for u such that *J(w)/(7*5( w )) « G(w) ^ 
*5(«)/(7*5(«)). ^d 

£(*,(w)] * -*,(u>) (53) 
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and 



Comparing (53)-(54) with the corresponding PS results (13) and (16), it is seen that 
for low instantaneous SNR the IPS method significantly decrease the variance of ♦.(«) 
compared to the standard PS method by forcing 4.M in (9) towards zero. Explicitly, 
the ratio between the IPS and PS variances are of order 0(*<H/*<»). One may also 
compare (53)-(54) with the approximative expression (47), noting that the ratio between 
them equals 9. 



APPENDIX F 



PS WITH OPTIMAL SUBTRACTION FACTOR 6 

An often considered modification of the Power Subtraction method is to consider 



Hbpsiu) = 



where $(«;) is a possibly frequency dependent function. In particular, with 6{uj) = 6 for 
some constant 6 > 1, the method is often referred as Power Subtraction with oversub- 
traction. This modification significantly decreases the noise level and reduces the tonal 
artifacts. In addition, it significantly distorts the speech, which makes this modification 
useless for high quality speech enhancement. This fact is easily seen from (55) when * » 1. 
Thus, for moderate and low speech to noise ratios (in the w-domain) the expression under 
the root-sign is very often negative and the rectifying device will therefore set it to zero 
(half-wave rectification), which implies that only frequency bands where the SNR is high 
will appear in the output signal §{k) in (3). Due to the non-linear rectifying device the 
present analysis technique is not directly applicable in this case, and since 6 > 1 leads to 
an output with poor audible quality this modification is not further studied. 

However, an interesting case is when 6{u) < 1, which is seen from the following 
heuristical discussion. As stated previously, when $ x (uj) and $ v (u;) are exactly known. 
(55) with 6(u>) = 1 is optimal in the sence of minimizing the squared PSD error. On the 
other hand, when $> x (u/) and $ v (lj) are completely unknown, that is no estimates of them 
are available, the best one can do is to estimate the speech by the noisy measurement 
itself, that is s(k) — x(fc), corresponding to the use of (55) with 5 = 0. Due the above 
two extremes, one can expect that when the unknown $x(u;) and $ v (u/) are replaced by, 
respectively 4> x (u) and $ v (w), the error E[$ s (uj)) 2 is minimized for some 6(lj) in the 
interval 0 < 6{uj) < 1. 

In addition, in an empirical quantity, the averaged spectral distortion improvement, 
similar to the PSD error was experimentally studied with respect to the subtraction factor 
for MS. Based on several experiments, it was concluded that the optimal subtraction factor 
preferably should be in the interval that span from 0.5 to 0.9. 

Explicitly, calculating the PSD error in this case gives 

~ (1 - + 6(u) ^I^A^u,) - A„(u,) j 



27 PCT/SE96/00024 

WO 96/24128 



(56) 



Taking the expectation of the squared PSD error gives 

E [* a {u>)} 2 * (1 - 6(u,)) 2 * 2 M + 6 2 *r*lM ( 5? ) 
where (41) is used. Equation (57) is quadratic in 6(u>) and can be analytically minimized. 
Denoting the optimal value by 6, the result reads 

1+7 

Note that since 7 in (58) is approximately frequency independent (at least for N > 1) 
also I is independent* the frequency. In particular, I is independent of ♦,(«) and •„(«),- 
which implies that the variance and the bias of *.(«) directly follows from (57). 

The value of I may be considerably smaller than one in some (realistic) cases. For 
example, once again considering j. = 1/r and * = 1. Then I is given by 



21 + 1/2t 

which clearly, for all r is smaUer than 0.5. In this case, the fact that I « 1 indicates 
that the uncertainty in the PSD estimators (and, in particular, the uncertainty in ♦,(«)) 
have a large impact on the quality (in terms of PSD error) of the output. Especially, the 
use of 6 « 1 implies that the speech to noise ratio improvement, from input to output 

signals, is small. 

An arising question is that if there, similarly to the weighting function for the IPS 
method in APPENDIX D, exists a data independent weighting function G( W ). In AP- 
PENDIX G , such a method is derived (and denoted IPS); 
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APPENDIX G 

DERIVATION OF Hf, PS (u>) 

In this appendix, we seek a data independent weighting factor (7(w) such that H{u) = 
\f&&) H SPS {u,) for some constant 6 (0 < 6 < 1) minimizes the expectation of the squared 
PSD error, cf (42). A straightforward calculation gives 



*.(<•>) = (G(u>) - l)*,( w ) + G(u,)(l - «)*„(u>) 

The expectation of the squared PSD error is given by 

E[* a (u,)] 2 =(G( w ) - 1) 2 * 2 H+G 2 (u/)(1 - «) 2 * 2 (u,) 

2(G(u>) - l)* s ( w )G(u;)(l - «)$„M+G 2 (u;)* 2 7 *2(a,) 



(59) 



(60) 



The right hand side of (60) is quadratic in G(u) and can be analytically minimized. The 
result <5(w) is given by 

G(u,) = 

»gM + ».(*)».(«)(!-*) 

* 2 (w) + 2* a (w)* v (w)(l-6)-f(l_^)2$2( u; ) + 5 2 7 $2( w ) 

= 1 

where 0 in the second equality is given by 

_ (1 - S) 2 + g 2 7 + (l - 6)*,(u,)/<!> v (u,) 

H-(1-5)* V H/*,( W ) < 62 ) 

For 6=1, (61)-(62) above reduce to the IPS method, (45), and for 6 = 0 we end up 
with the standard PS. Replacing *,(u>) and *„(u/) in (61)-(62) with their corresponding 
estimated quantities * x (u>) - * v (u,) and *„(u>), respectively, give rise to a method, which 
in view of the IPS method, is denoted *IPS. The analysis of the 61PS method is similar to 
the analysis of the IPS method, but requires a lot of efforts and tedious straightforward 
calculations, and is therefore omitted. 
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CLAIMS 

1. A spectral subtraction noise suppression method in a frame based digital commu- 
nication system, each frame including a predermined number N of audio samples, thereby 
giving each frame N degrees of freedom, wherein a spectral subtraction function H(u) is 
based on an estimate & v (u>) of the power spectral density of background noise of non- 
speech frames and an estimate of the power spectral density of speech frames, 
characterized by: 

approximating each speech frame by a parametric model that reduces the number of 
degrees of freedom to less than N; and 

estimating said estimate * x (u;) of the power spectral density of each speech frame by 
a parametric power spectrum estimation method based on the approximative parametric 
model 

estimating said estimate $„(u>) of the power spectral density of each non-speech frame 
by a non-parametric power spectrum estimation method. 

2. The method of claim 1, characterized by said approximative parametric model 
being an autoregressive (AR) model. 

3. The method of claim 2, characterized by said autoregressive (AR) model being 
approximately of order y/N. 

4. The method of claim 3, characterized by said autoregressive (AR) model being 
approximately of order 10. 

5. The method of claim 3, characterized by a spectral subtraction function H(u) in 
accordance with the formula: 



where G(u>) is a weighting function and 6(u>) is a subtraction factor. 

6. The method of claim 5, characterized by G{uj) = 1. 

7. The method of claim 5 or 6, characterized by 6(u/) being a constant < 1. 

8. The method of claim 3, characterized by a spectral subtraction function H(u>) in 
accordance with the formula: 



H(u) = 1 - 



*,(«) 



9. The method of claim 3, characterized by a spectral subtraction function H(u>) i 
accordance with the formula: 

( 



■ 10. The method of claim 3, characterized by a spectral subtraction function H{u) in 
accordance with the formula: 
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